239 research outputs found

    Genome-wide inference of ancestral recombination graphs

    Get PDF
    The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. Preliminary results also indicate that our methods can be used to gain insight into complex features of human population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version contains a substantially expanded genomic data analysi

    How Much Information is Provided by Human Epigenomic Data? An Evolutionary View

    Get PDF
    ABSTRACT Here, we ask the question, “How much information do available epigenomic data sets provide about human genomic function, individually or in combination?” We consider nine epigenomic and annotation features across 115 cell types and measure genomic function by using signatures of natural selection as a proxy. We measure information as the reduction in entropy under a probabilistic evolutionary model that describes genetic variation across ∼50 diverse humans and several nonhuman primates. We find that several genomic features yield more information in combination than they do individually, with DNase-seq displaying particularly strong synergy. Most of the entropy in human genetic variation, by far, reflects mutation and neutral drift; the genome-wide reduction in entropy due to selection is equivalent to only a small fraction of the storage requirements of a single human genome. Based on this framework, we produce cell-type-specific maps of the probability that a mutation at each nucleotide will have fitness consequences ( FitCons scores). These scores are predictive of known functional elements and disease-associated variants, they reveal relationships among cell types, and they suggest that ∼8% of nucleotide sites are constrained by natural selection

    Error and Error Mitigation in Low-Coverage Genome Assemblies

    Get PDF
    The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ~2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644111)National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644282)National Science Foundation (U.S.) (Faculty Early Career Development grant U54 HG004555-01)David & Lucile Packard FoundationDavid & Lucile Packard Foundation (Fellowship for Science and Engineering

    Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graph

    Get PDF
    The sequencing of Neanderthal and Denisovan genomes has yielded many new insights about interbreeding events between extinct hominins and the ancestors of modern humans. While much attention has been paid to the relatively recent gene flow from Neanderthals and Denisovans into modern humans, other instances of introgression leave more subtle genomic evidence and have received less attention. Here, we present an extended version of the ARGweaver algorithm, ARGweaver-D, which can infer local genetic relationships under a user-defined demographic model that includes population splits and migration events. This Bayesian algorithm probabilistically samples ancestral recombination graphs (ARGs) that specify not only tree topology and branch lengths along the genome, but also indicate migrant lineages. The sampled ARGs can therefore be parsed to produce probabilities of introgression along the genome. We show that this method is well powered to detect the archaic migration into modern humans, even with only a few samples. We then show that the method can also detect introgressed regions stemming from older migration events, or from unsampled populations. We apply it to human, Neanderthal, and Denisovan genomes, looking for signatures of older proposed migration events, including ancient humans into Neanderthal, and unknown archaic hominins into Denisovans. We identify 3% of the Neanderthal genome that is putatively introgressed from ancient humans, and estimate that the gene flow occurred between 200-300kya. We find no convincing evidence that negative selection acted against these regions. We also identify 1% of the Denisovan genome which was likely introgressed from an unsequenced hominin ancestor, and note that 15% of these regions have been passed on to modern humans through subsequent gene flow

    ACE: A Probabilistic Model for Characterizing Gene-Level Essentiality in CRISPR Screens

    Get PDF
    High-throughput knockout screens based on CRISPR-Cas9 are widely used to evaluate the essentiality of genes across a range of cell types. Here we introduce a probabilistic modeling framework, Analysis of CRISPR-based Essentiality (ACE), that enables new statistical tests for essentiality based on the raw sequence read counts from such screens. ACE estimates the essentiality of each gene using a flexible likelihood framework that accounts for multiple sources of variation in the CRISPR-Cas9 experimental process. In addition, the method can identify genes that differ in their degree of essentiality across samples using a likelihood ratio test. We show using simulations that ACE is competitive with the best available methods in predicting essentiality, and is especially useful for the identification of differential essentiality. Furthermore, by applying ACE to publicly available CRISPR-screen data, we are able to identify both known and previously overlooked candidates for genotype-specific essentiality, including RNA m 6 -A methyltransferases that exhibit enhanced essentiality in the presence of inactivating TP53 mutations. In summary, ACE provides improved quantification of essentiality specific to cancer subtypes, and a robust probabilistic framework for identifying genes responsive to therapeutic targeting

    ACE: a probabilistic model for characterizing gene-level essentiality in CRISPR screens.

    Get PDF
    High-throughput CRISPR-Cas9 knockout screens are widely used to evaluate gene essentiality in cancer research. Here we introduce a probabilistic modeling framework, Analysis of CRISPR-based Essentiality (ACE), that accounts for multiple sources of variation in CRISPR-Cas9 screens and enables new statistical tests for essentiality. We show using simulations that ACE is effective at predicting both absolute and differential essentiality. When applied to publicly available data, ACE identifies known and novel candidates for genotype-specific essentiality, including RNA m6-A methyltransferases that exhibit enhanced essentiality in the presence of inactivating TP53 mutations. ACE provides a robust framework for identifying genes responsive to subtype-specific therapeutic targeting

    Selective sweeps on different pigmentation genes mediate convergent evolution of island melanism in two incipient bird species

    Get PDF
    Insular organisms often evolve predictable phenotypes, like flightlessness, extreme body sizes, or increased melanin deposition. The evolutionary forces and molecular targets mediating these patterns remain mostly unknown. Here we study the Chestnut-bellied Monarch (Monarcha castaneiventris) from the Solomon Islands, a complex of closely related subspecies in the early stages of speciation. On the large island of Makira M. c. megarhynchus has a chestnut belly, whereas on the small satellite islands of Ugi, and Santa Ana and Santa Catalina (SA/SC) M. c. ugiensis is entirely iridescent blue-black (i.e., melanic). Melanism has likely evolved twice, as the Ugi and SA/SC populations were established independently. To investigate the genetic basis of melanism on each island we generated whole genome sequence data from all three populations. Non-synonymous mutations at the MC1R pigmentation gene are associated with melanism on SA/SC, while ASIP, an antagonistic ligand of MC1R, is associated with melanism on Ugi. Both genes show evidence of selective sweeps in traditional summary statistics and statistics derived from the ancestral recombination graph (ARG). Using the ARG in combination with machine learning, we inferred selection strength, timing of onset and allele frequency trajectories. MC1R shows evidence of a recent, strong, soft selective sweep. The region including ASIP shows more complex signatures; however, we find evidence for sweeps in mutations near ASIP, which are comparatively older than those on MC1R and have been under relatively strong selection. Overall, our study shows convergent melanism results from selective sweeps at independent molecular targets, evolving in taxa where coloration likely mediates reproductive isolation with the neighboring chestnut-bellied subspecies
    corecore